Filtering Web Pages with Application Ontologies

نویسندگان

  • Li Xu
  • David W. Embley
چکیده

Automatically recognizing which HTML documents on the Web contain objects that are “of interest” for a user is non-trivial. As a step toward solving this problem for information filtering in which a user expresses a long-term need for information with a specific profile, we propose an approach based on ontological descriptions. The HTML documents we consider include semistructured HTML documents, HTML tables, and HTML forms. Given the keywords and values and kinds of values recognized by an ontological specification in an HTML document, we apply several heuristics: (1) a density heuristic that measures the percent of the document that appears to apply to the application ontology, (2) an expected-value heuristic that compares the number and kind of values found in the document to the number and kind expected by the application ontology, and (3) a grouping heuristic that considers whether the values of the document appear to be grouped as application-ontology records. Then, based on machinelearned rules over these heuristic measurements, we determine whether an HTML document contains objects of interest with respect to an application ontology. Our experimental results show that we have been able to achieve about 95% for both recall and precision.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Use of Semantic Similarity and Web Usage Mining to Alleviate the Drawbacks of User-Based Collaborative Filtering Recommender Systems

  One of the most famous methods for recommendation is user-based Collaborative Filtering (CF). This system compares active user’s items rating with historical rating records of other users to find similar users and recommending items which seems interesting to these similar users and have not been rated by the active user. As a way of computing recommendations, the ultimate goal of the user-ba...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Analyzing new features of infected web content in detection of malicious web pages

Recent improvements in web standards and technologies enable the attackers to hide and obfuscate infectious codes with new methods and thus escaping the security filters. In this paper, we study the application of machine learning techniques in detecting malicious web pages. In order to detect malicious web pages, we propose and analyze a novel set of features including HTML, JavaScript (jQuery...

متن کامل

Dynamic Ontologies on the Web

We discuss the problems associated with managing ontologies in distributed environments such as the Web. The Web poses unique problems for the use of ontologies because of the rapid evolution and autonomy of web sites. We present SHOE, a web-based knowledge representation language that supports multiple versions of ontologies. We describe SHOE in the terms of a logic that separates data from on...

متن کامل

Ontologies Come of Age

Ontologies have moved beyond the domains of library science, philosophy, and knowledge representation. They are now the concerns of marketing departments, CEOs, and mainstream business. Research analyst companies such as Forrester Research report on the critical roles of ontologies in support of browsing and search for e-commerce and in support of interoperability for facilitation of knowledge ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005